Organizing Text for Analysis

Published on by gail_massari | Updated on

Video was updated in August 2024. Q&A is included after time 30:00.

Text Explorer is useful for exploring unstructured text, such as survey free-response fields or incident reports, to understand its meaning. This is often an iterative process, where you alternate between curating and analyzing a list of terms and phrases.

See two case studies (Europe Vacation and Which Wine?) to see how to:

Use Text Exlorer input menu to start exploring text
Understand data used in Hotel Review case study
Understand and refine Term and Phrase Lists and Word Cloud
Use Data Filter to explore Word Cloud results
Use JMP Pro for Sentiment Analysis and Term Selection
Use Multivariate Embedding to graph text results into two or three groups to understand related results

JMP Pro users may want to use the Torch Deep Learning Add-In to extend text analysis capabilities.

Kemal @kemal_oflus, Olivia @O_Lippincott and Ross @Ross_Metusalem answered questions the live webinars:

Q: How do I see Text Explorer menu?

A: From File>Preferences click on the Menu group on the left. See if “Advanced Modeling” is checked or not. It not, check the box and click OK. The Menu preferences allow you to display or hide menu items to customize your JMP interface.

Q: Is it possible to set a user preference to always default to a centered layout for the word cloud?

A: Yes, under File>Preferences you can set specific platform preferences. Under the Text Explorer platform in preferences an option exists to change the default layout to centered for the word cloud.

Q: Could you please explain the how-to properly setup the binary column and its usage for analytics?

A: The Document Term Matrix gives a binary column for each term in the term list. This turns the text into a numerical dataset that can be used in modeling or graphing. Also see the Q&A in the video at time ~ 32:20 for more details.

Q: What is Local in Document Term Matrix?

A: Local uses stop words only in the current Text Explorer platform. If you would like to use the stop words with another dataset or another launch of the platform with the current table you can create a user library with the user option. Also see the Q&A in the video at time ~ 32:20 for more details.

Q: Can I use RegEx to explore text?

A: Yes. A blog post on data cleaning gives some information, as does this Discovery Summit presentation.

Q: The presenter used binary weighting in the initial text analysis, but used TF IDF for the SVD analysis. Should you stick with one weighting throughout the analysis, or is it helpful to use different weightings for different portions of the analysis?

A: TF IDF is the default weighting for an SVD analysis. Binary is the default unless an SVD analysis has been run. The other weighting options have specific use cases. . Also see the Q&A in the video at time ~ 32:20 for more details.

Q: Have you considered adding a feature for "coded" text/cryptography in order to help recognize and analyze patterns in coded text?

A: Some of this can be addressed by using custom Regex. There is an option for this on that initial launch window. There, you can teach JMP how to identify many patterns and what to do when it finds them.

Q: Can you do a word cloud for the phrases instead of the words?

A: The word cloud is using the term list to generate the word cloud. If you add phrases to the term list like Kemal did in the example, the word cloud will also show the phrases.

Q: Do you find that you change the default text explorer options (maximum characters per word, stemming, maximum words per phrase)? I usually leave these alone and wonder if it is important to change them.

A: In most cases I leave them alone except for minimum characters per word. Subject matter experts may have inputs here. If you want to customize a Regular Expression and you have some unique terms, you can customize See video below.

Q: Is Sentiment Analysis only in JMP Pro?

A: Yes, see more about Sentiment Analysis using JMP Pro.

Q: Do you have any tips with working with custom Regular Expressions with JMP and text analysis. I find that I spend a lot of time trying to do this and get limited results.

A: We offer a course on Text Analysis that you might find useful. It includes various examples of how to use Regex. There is a JMP blog post about Regex for cleaning data that will be posting in a JMP Community Blog.

Q: t-SNE plot: what does x-y #s mean?

A: For that I would recommend watching a video on multivariate method clustering or on teaching clustering. In this case, we had 25 different columns and reduced that multi-dimensional space down to 2. Think of this as similar to principal components. You can choose how many outputs (or components) you want.

Multivariate Output Options.JPG

Q: Sometimes there is red text in the phrase list. What does that signify?

A: It identifies a default color in our library.

Phrase Colors.JPG

Questions answered at a previous demo on this topic:

Q: Do you have to pre-configure the Word cloud to ignore super common words, like "the"?

A: There are a list of "stop words" that are super common words, like "the", that JMP uses to exclude from the analysis. See documentation on using stop words and related topics.

Q: Can we import delimited txt files and analyze columns individually?

A: JMP can import delimited text files into a JMP data table. You could analyze multiple text columns individually which would give a separate output for each text column.

Q: In his example, if red color shows the most frequent word, why isn't the biggest word "red"

A: The average score column in the data table is the basis for coloring in his example. The size is the most frequent word/term. In this example, Kemal specified that this word cloud coloring should be controlled by the Average score column from the data table. So Average_score doesn’t come from JMP, but from the data table. See more about Word Cloud options.

Q: What are tokens?

A: Tokens are Words/Terms plus Phrases/Cases or a combination of those . In this case I had 15792 Terms, 59360 Cases (or Phrases) and about 17 of these per row (Total Tokens per Case). See more details.

The Tokenizing stage converts text to lowercase, then applies a ‘Tokenizing’ method (either Basic Words or Regex) to group characters into tokens, and then it recodes tokens based on specified recode definitions. Note that recoding occurs before stemming and recode operations are processed internally in one pass regardless of the order that they are specified in the report window. See more about the Text Processing Steps.

Tokens Tokens

Q: Can you make a Phrase Cloud rather than a Word Cloud?

A: We contacted JMP Developer, Xan Gregg @XanGregg, after the live webinar. He suggested that the Word Cloud might more accurately be called a Term Cloud. It shows all the words in the Term list. I suppose you could add a lot of Phrases as Terms and remove words by making them stop words, and then you would get a cloud of Phrases. If you are interested in requesting Phrase Clouds, feel free to post request and rationale to the JMP Software Wish List.

Q: I am really interested in the "document clustering" you mentioned earlier. Where can I find information on this where among 100s of text documents and I want to find which ones are in a similar group?

A: You can use JMP Pro for Latent Class Analysis and Latent Semantic Analysis.

Latent Analysis Options

See this brief overview of how to use Latent Analyses.

A: What are documents and how do you import them into JMP?

Q: Each row is considered a document and you import it from any file that captures it in rows. For example, you can import the text from Excel. In this example, the reviews were captured as rows in an Excel file.

Resources

Overview of Text Explorer (Hint: Work through the documentation pages as a quasi-tutorial.)
Documentation on Predictor Screening
Video on Using Text Explorer to Extend Analysis

Start:

Fri, Aug 16, 2024 02:00 PM EDT

End:

Fri, Aug 16, 2024 03:00 PM EDT

0 Comments

Organizing Text for Analysis

Advanced Statistical Modeling